Before you begin, you should install all of the relevent R packages. You can download our packages as follows…
library(cytomapper)
library(dplyr)
library(ggplot2)
library(simpleSeg)
library(FuseSOM)
library(ggpubr)
library(scater)
library(spicyR)
library(ClassifyR)
library(scFeatures)
library(lisaClust)
It is convenient to set the number of cores for running code in parallel. Please chose a number that is appropriate for your resources.
nCores <- 20
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())
In the following we will reanalyse some MIBI-TOF data (Risom et al, 2022) profiling the spatial landscape of ductal carcinoma in situ (DCIS), which is a pre-invasive lesion that is thought to be a precursor to invasive breast cancer (IBC). The key conclusion of this manuscript (amongst others) is that spatial information about cells can be used to predict disease progression in patients. We will use our spicy workflow to make a similar conclusion.
The R code for this analysis is available on github https://github.com/SydneyBioX/spicyWorkflow. A mildly processed version of the data used in the manuscript is available in this repository.
The images are stored in the images folder within the Data folder. Here we use readImages() from the EBImage package to read these into R. If memory is a restricting factor, and the files are in a slightly different format, you could use loadImages() from the cytomapper package to load all of the tiff images into a CytoImageList object, which can store the images as h5 on-disk.
pathToImages <- "Data/images"
# Get directories of images
imageDirs <- dir(pathToImages, full.names = TRUE)
names(imageDirs) <- dir(pathToImages, full.names = FALSE)
# Get files in each directory
files <- sapply(imageDirs, list.files, pattern = "tif", full.names = TRUE, simplify = FALSE)
# Read files with readImage from EBImage
images <- lapply(files, EBImage::readImage, as.is = TRUE)
We will make use of the on_disk option to convert our images to a CytoImageList with the images not held in memory.
# Store images in a CytoImageList with images on_disk as h5 files to save memory.
dir.create("Data/h5Files")
images <- cytomapper::CytoImageList(images,
on_disk = TRUE,
h5FilesPath = "Data/h5Files",
BPPARAM = BPPARAM)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 15998577 854.5 31235335 1668.2 19353840 1033.7
## Vcells 26137571 199.5 2824080636 21546.1 3530100794 26932.6
To associate features in our image with disease progression, it is important to read in information which links image identifiers to their progression status. We will do this here, making sure that our imageID match.
## Read the clinical data
# Read in clinical data, manipulate imageID and select columns
clinical <- read.csv("Data/1-s2.0-S0092867421014860-mmc1.csv")
clinical <- clinical |>
mutate(imageID = paste0("Point", PointNumber, "_pt", Patient_ID, "_", TMAD_Patient))
clinical$imageID[grep("normal", clinical$Tissue_Type)] <- paste0(clinical$imageID[grep("normal", clinical$Tissue_Type)], "_Normal")
clinicalVariables <- c("imageID", "Patient_ID","Status", "Age", "SUBTYPE", "PAM50", "Treatment", "DCIS_grade", "Necrosis")
rownames(clinical) <- clinical$imageID
We can then store the clinical information in the mcols of the CytoImageList.
# Add the clinical data to mcols of images.
mcols(images) <- clinical[names(images), clinicalVariables]
Our simpleSeg R package on https://github.com/SydneyBioX/simpleSeg provides a series of functions to generate simple segmentation masks of images. These functions leverage the functionality of the EBImage package on Bioconductor. For more flexibility when performing your segmentation in R we recommend learning to use the EBimage package. A key strength of the simpleSeg package is that we have coded multiple ways to perform some simple segmentation operations as well as incorporating multiple automatic procedures to optimise some key parameters when these aren’t specified.
If your images are stored in a list or CytoImageList they can be segmented with a simple call to simpleSeg(). Here we have ask simpleSeg to do multiple things. First, we would like to use a combination of principal component analysis of all channels guided by the H33 channel to summarised the nuclei signal in the images. Secondly, to estimate the cell body of the cells we will simply dilate out from the nuclei by 2 pixels. We have also requested that the channels be square root transformed and that a minimum cell size of 40 pixels be used as a size selection step.
# Generate segmentation masks
masks <- simpleSeg(images,
nucleus = c("PCA", "HH3"),
cellBody = "dilate",
transform = "sqrt",
sizeSelection = 40,
discSize = 2,
cores = nCores)
The display and colorLabels functions in EBImage make it very easy to examine the performance of the cell segmentation. The great thing about display is that if used in an interactive session it is very easy to zoom in and out of the image.
# Visualise segmentation performance one way.
EBImage::display(colorLabels(masks[[1]]))
The plotPixels function in cytomapper make it easy to overlay the masks on top of the intensities of 6 markers. Here we can see that the segmentation appears to be performing reasonably.
# Visualise segmentation performance another way.
cytomapper::plotPixels(image = images[1],
mask = masks[1],
img_id = "imageID",
colour_by = c("PanKRT", "GLUT1", "HH3", "CD3", "CD20"),
display = "single",
colour = list(HH3 = c("black","blue"),
CD3 = c("black","purple"),
CD20 = c("black","green"),
GLUT1 = c("black", "red"),
PanKRT = c("black", "yellow")),
bcg = list(HH3 = c(0, 1, 1.5),
CD3 = c(0, 1, 1.5),
CD20 = c(0, 1, 1.5),
GLUT1 = c(0, 1, 1.5),
PanKRT = c(0, 1, 1.5)),
legend = NULL)
In order to charactise the phenotypes of each of the segmented cells, measureObjects from cytomapper will calculate the average intensity of each channel within each cell as well as a few morphological features. The channel intensities will be stored in the counts assay in a SingleCellExperiment. Information on the spatial location of each cell is stored in colData in the m.cx and m.cy columns. In addition to this, it will propogate the information we have store in the mcols of our CytoImageList in the colData of the resulting SingleCellExperiment.
# Summarise the experssion of each marker in each cell
cells <- cytomapper::measureObjects(masks,
images,
img_id = "imageID",
BPPARAM = BPPARAM)
We should check to see if the marker intensities of each cell require some form of transformation or normalisation. Here we extract the intensities from the counts assay. Looking at PanKRT which should be expressed in the majority of the tumour cells, the intensities are clearly very skewed.
# Extract marker data and bind with information about images
df <- as.data.frame(cbind(colData(cells), t(assay(cells, "counts"))))
# Plots densities of PanKRT for each image.
ggplot(df, aes(x = PanKRT, colour = imageID)) +
geom_density() +
theme(legend.position = "none")
We can transform and normalise our data using the normalizeCells function. Here we have taken the intensities from the counts assay, performed a square root transform, then for each image trimmed the 99 quantile and min-max scaled to 0-1. This modified data is then store in the norm assay by default. We can see that this normalised data appears more bimodal, not perfect, but likely sufficient for clustering.
# Transform and normalise the marker expression of each cell type.
# Use a square root transform, then trimmed the 99 quantile
cells <- normalizeCells(cells,
transformation = "asinh",
method = c("trim99", "minMax"),
assayIn = "counts",
cores = nCores)
# Extract normalised marker information.
df <- as.data.frame(cbind(colData(cells), t(assay(cells, "norm"))))
# Plots densities of normalised PanKRT for each image.
ggplot(df, aes(x = CD68, colour = imageID)) +
geom_density() +
theme(legend.position = "none")
Our FuseSOM R package on https://github.com/ecool50/FuseSOM and provides a pipeline for the clustering of highly multiplexed in situ imaging cytometry assays. This pipeline uses the Self Organizing Map architecture coupled with Multiview hierarchical clustering and provides functions for the estimation of the number of clusters.
Here we cluster using the runFuseSOM function. We have chosen to specify the same subset of markers used in the original manuscript for gating cell types. We have also specified the number of clusters to identify to be 20.
cells@metadata <- list()
# The markers used in the original publication to gate cell types.
useMarkers <- c("PanKRT", "ECAD", "CK7", "VIM", "FAP", "CD31", "CK5", "SMA",
"CD45", "CD4", "CD3", "CD8", "CD20", "CD68", "CD14", "CD11c",
"HLADRDPDQ", "MPO", "Tryptase")
# Set seed.
set.seed(51773)
# Generate SOM and cluster cells into 20 groups.
cells <- runFuseSOM(cells,
markers = useMarkers,
assay = 'norm',
numClusters = 20)
We can begin the process of understanding what each of these cell clusters are by using the plotGroupedHeatmap function from scater. At the least, here we can see we capture all of the major immune populations that we expect to see.
# Visualise marker expression in each cluster.
scater::plotGroupedHeatmap(cells,
features = useMarkers,
group = "clusters",
exprs_values = "norm",
center = TRUE,
scale = TRUE,
zlim = c(-3,3),
cluster_rows = FALSE)
We can check to see how reasonable our choice of 20 clusters is using the estimateNumCluster and the optiPlot functions. Here we examine the Gap method, others such as Silhouette and Within Cluster Distance are also available.
# Generate metrics for estimating the number of clusters.
# As I've already run runFuseSOM I don't need to run generateSOM().
cells <- estimateNumCluster(cells, kSeq = 2:30)
optiPlot(cells, method = "gap")
# Check cluster frequencies.
colData(cells)$clusters |>
table() |>
sort()
##
## cluster_20 cluster_14 cluster_18 cluster_9 cluster_19 cluster_16 cluster_12
## 302 419 550 714 716 934 1060
## cluster_10 cluster_13 cluster_11 cluster_4 cluster_15 cluster_17 cluster_1
## 1110 1114 1130 1135 1159 1232 1615
## cluster_7 cluster_5 cluster_6 cluster_2 cluster_3 cluster_8
## 2819 3611 4959 7930 15677 20886
We recommend using a package such as diffcyt for testing for changes in abundance of cell types. However, the colTest function allows us to quickly test for associations between the proportions of the cell types and progression status using either wilcoxon rank sum tests or t-tests. Here we see a p-value less than 0.05 but this does not equate to a small fdr.
# Select cells which belong to individuals with progressor status.
cellsToUse <- cells$Status%in%c("nonprogressor", "progressor")
# Perform simple wicoxon rank sum tests on the columns of the proportion matrix.
testProp <- colTest(cells[, cellsToUse],
condition = "Status",
feature = "clusters")
testProp
## W pval adjPval cluster
## cluster_5 180 0.022 0.44 cluster_5
## cluster_20 210 0.063 0.63 cluster_20
## cluster_13 390 0.150 0.75 cluster_13
## cluster_6 390 0.150 0.75 cluster_6
## cluster_12 370 0.250 0.82 cluster_12
## cluster_10 360 0.330 0.82 cluster_10
## cluster_15 360 0.330 0.82 cluster_15
## cluster_9 260 0.420 0.82 cluster_9
## cluster_7 270 0.460 0.82 cluster_7
## cluster_11 340 0.510 0.82 cluster_11
## cluster_2 340 0.510 0.82 cluster_2
## cluster_1 280 0.620 0.82 cluster_1
## cluster_17 280 0.640 0.82 cluster_17
## cluster_3 280 0.650 0.82 cluster_3
## cluster_18 330 0.670 0.82 cluster_18
## cluster_8 330 0.700 0.82 cluster_8
## cluster_14 290 0.720 0.82 cluster_14
## cluster_19 320 0.760 0.82 cluster_19
## cluster_4 290 0.780 0.82 cluster_4
## cluster_16 300 0.820 0.82 cluster_16
imagesToUse <- rownames(clinical)[clinical[, "Status"]%in%c("nonprogressor", "progressor")]
prop <- getProp(cells, feature = "clusters")
clusterToUse <- rownames(testProp)[1]
boxplot( prop[imagesToUse, clusterToUse] ~ clinical[imagesToUse, "Status"] )
As our data is stored in a SingleCellExperiment we can also use scater to perform and visualise our data in a lower dimension to look for image or cluster differences.
set.seed(51773)
# Perform dimension reduction using UMP.
cells <- scater::runUMAP(cells,
subset_row = useMarkers,
exprs_values = "norm")
# Select a subset of images to plot.
someImages <- unique(colData(cells)$imageID)[c(1,10,20,40,50,60)]
# UMAP by imageID.
scater::plotReducedDim(cells[,colData(cells)$imageID %in% someImages], dimred="UMAP", colour_by="imageID")
# UMAP by cell type cluster.
scater::plotReducedDim(cells[,colData(cells)$imageID %in% someImages], dimred="UMAP", colour_by="clusters")
Our spicyR package (https://www.bioconductor.org/packages/devel/bioc/html/spicyR.html)[https://www.bioconductor.org/packages/devel/bioc/html/spicyR.html] provides a series of functions to aid in the analysis of both immunofluorescence and mass cytometry imaging data as well as other assays that can deeply phenotype individual cells and their spatial location. Here we use the spicy function to test for changes in the spatial relationships between pairwise combinations of cells. We quantify spatal relationships using a combination of three radii Rs = c(20, 50, 100) and mildy account for some of the global tissue structure using sigma = 50.
# Test for changes in pairwise spatial relationships between cell types.
spicyTest <- spicy(cells[, cellsToUse],
condition = "Status",
cellType = "clusters",
imageID = "imageID",
spatialCoords = c("m.cx", "m.cy"),
Rs = c(20, 50, 100),
sigma = 50,
BPPARAM = BPPARAM)
topPairs(spicyTest, n = 10)
## intercept coefficient p.value adj.pvalue from
## cluster_8__cluster_4 -78.88676 74.33513 0.002936530 0.4226348 cluster_8
## cluster_4__cluster_8 -80.52218 75.32697 0.003337986 0.4226348 cluster_4
## cluster_4__cluster_3 -89.90533 56.55626 0.004065831 0.4226348 cluster_4
## cluster_1__cluster_5 -113.69861 69.52234 0.006175262 0.4226348 cluster_1
## cluster_5__cluster_9 -115.73862 58.09037 0.006612725 0.4226348 cluster_5
## cluster_8__cluster_19 -93.36838 82.84290 0.007599857 0.4226348 cluster_8
## cluster_3__cluster_4 -83.76515 52.04157 0.008722023 0.4226348 cluster_3
## cluster_19__cluster_8 -95.80017 79.99067 0.009515043 0.4226348 cluster_19
## cluster_20__cluster_10 -126.56173 107.57596 0.011209215 0.4226348 cluster_20
## cluster_9__cluster_5 -121.19759 48.68111 0.012894841 0.4226348 cluster_9
## to
## cluster_8__cluster_4 cluster_4
## cluster_4__cluster_8 cluster_8
## cluster_4__cluster_3 cluster_3
## cluster_1__cluster_5 cluster_5
## cluster_5__cluster_9 cluster_9
## cluster_8__cluster_19 cluster_19
## cluster_3__cluster_4 cluster_4
## cluster_19__cluster_8 cluster_8
## cluster_20__cluster_10 cluster_10
## cluster_9__cluster_5 cluster_5
We can visualise these tests using signifPlot where we observe that cell type pairs appear to become less attractive (or avoid more) in the progression sampls.
# Visualise which relationships are changing the most.
signifPlot(spicyTest,
breaks = c(-1.5, 3, 0.5))
Our lisaClust package (https://www.bioconductor.org/packages/devel/bioc/html/lisaClust.html)[https://www.bioconductor.org/packages/devel/bioc/html/lisaClust.html] provides a series of functions to identify and visualise regions of tissue where spatial associations between cell-types is similar. This package can be used to provide a high-level summary of cell-type colocalization in multiplexed imaging data that has been segmented at a single-cell resolution. Here we use the lisaClust function to clusters cells into 5 regions with distinct spatial ordering.
set.seed(51773)
# Cluster cells into spatial regions with similar composition.
cells <- lisaClust(cells,
k = 5,
Rs = c(20, 50, 100),
sigma = 50,
spatialCoords = c("m.cx", "m.cy"),
cellType = "clusters",
BPPARAM = BPPARAM)
We can try to interpret which spatial orderings the regions are quantifying using the regionMap function. This plots the frequency of each cell type in a region relative to what you would expect by chance.
# Visualise the enrichment of each cell type in each region
regionMap(cells, cellType = "clusters", limit = c(0.2, 5))
By default, these identified regions are stored in the regions column in the colData of our object. We can quickly examine the spatial arrangement of these regions using ggplot.
# Extract cell information and filter to specific image.
df <- colData(cells) |>
as.data.frame() |>
filter(imageID == "Point2206_pt1116_31620")
# Colour cells by their region.
ggplot(df, aes(x = m.cx, y = m.cy, colour = region)) +
geom_point()
While much slower, we have also implemented a function for overlaying the region information as a hatching pattern so that the information can be viewed simultaneously with the cell type calls.
# Use hatching to visualise regions and cell types.
hatchingPlot(cells,
useImages = "Point2206_pt1116_31620",
cellType = "clusters",
spatialCoords = c("m.cx", "m.cy")
)
This plot is a ggplot object and so the scale can be modified with scale_region_manual.
# Use hatching to visualise regions and cell types.
# Relabel the hatching of the regions.
hatchingPlot(cells,
useImages = "Point2206_pt1116_31620",
cellType = "clusters",
spatialCoords = c("m.cx", "m.cy"),
window = "square",
nbp = 300,
line.spacing = 41) +
scale_region_manual(values = c(region_1 = 2,
region_2 = 1,
region_3 = 5,
region_4 = 4,
region_5 = 3)) +
guides(colour = guide_legend(ncol = 2))
If needed, we can again quickly use the colTest function to test for associations between the proportions of the cells in each region and progression status using either wilcoxon rank sum tests or t-tests. Here we see a adjusted p-value less than 0.05.
# Test if the proportion of each region is associated
# with progression status.
testRegion <- colTest(cells[,cellsToUse],
feature = "region",
condition = "Status")
testRegion
## W pval adjPval cluster
## region_5 240 0.21 0.84 region_5
## region_2 360 0.40 0.84 region_2
## region_1 340 0.58 0.84 region_1
## region_4 330 0.70 0.84 region_4
## region_3 320 0.84 0.84 region_3
scFeatures is an R package available on https://github.com/SydneyBioX/scFeatures that generates multi-view representations of single-cell and spatial data through the construction of a total of 17 feature types. Here we use it to quantify the proportions of each cell type in each image as well as the average expression of each marker on each cell type. scFeatures outpus a list of two data.frames in this case.
# Use scFeatures to calculate proportions and the average marker abundance
# for each cell type.
data <- scFeatures(cells,
feature_types = c("proportion_raw", "gene_mean_celltype"),
sample = "imageID",
celltype = "clusters",
assay = "norm",
ncores = nCores )
## [1] "generating proportion raw features"
## [1] "generating gene mean celltype features"
names(data) <- c("prop", "mean")
imagesToUse <- rownames(clinical)[clinical[, "Status"]%in%c("nonprogressor", "progressor")]
# Test each marker-celltype for it's association with progression.
test <- colTest(data$mean[imagesToUse,],
condition = clinical[imagesToUse, "Status"])
test |> head()
## W pval adjPval cluster
## cluster_5--CD8 120 2.1e-06 0.0017 cluster_5--CD8
## cluster_2--CD3 120 2.1e-05 0.0086 cluster_2--CD3
## cluster_1--ECAD 130 8.2e-04 0.2200 cluster_1--ECAD
## cluster_5--CD3 160 1.5e-03 0.2400 cluster_5--CD3
## cluster_9--ECAD 170 1.8e-03 0.2400 cluster_9--ECAD
## cluster_4--P 480 1.8e-03 0.2400 cluster_4--P
Our ClassifyR package, https://github.com/SydneyBioX/ClassifyR, formalises a convenient framework for evaulating classification in R. We provide functionaility to easily include four key modelling stages; Data transformation, feature selection, classifier training and prediction; into a cross-validation loop. Here we use the crossValidate function to perform 100 repeats of 5-fold cross-validation to evaluate the performance of an elastic net model applied to three quanitifications of our MIBI-TOF data; cell type proportions, average mean of each cell type and region proportions.
# Add proportions of each region in each image
# to the list of dataframes.
data[["regions"]] <- getProp(cells, "region")
# Subset data images with progression status
measurements <- lapply(data, function(x)x[imagesToUse, ])
# Set seed
set.seed(51773)
# Perform cross-validation of an elastic net model
# with 100 repeats of 5-fold cross-validation.
cv <- crossValidate(measurements = measurements,
outcome = clinical[imagesToUse, "Status"],
classifier = "elasticNetGLM",
nFolds = 5,
nRepeats = 100,
nCores = nCores
)
Here we use the performancePlot function to assess the AUC from each repeat of the 5-fold cross-validation. We see that the lisaClust regions appear to capture information which is predictive of progression status of the patients.
# Calculate AUC for each cross-validation repeat and plot.
performancePlot(cv,
performanceName = "AUC",
characteristicsList = list(x = "Assay Name"))
Here we have used a pipeline of our spatial analysis R packages to demonstrate an easy way to segment, cluster, normalise, quantify and classify high dimensional in situ cytometry data all within R.